LIME walk-through: Part 2¶

Multi-class Classification of Text Documents¶

Previously, we looked at classifying a subset of the 20 newsgroups dataset as either being on the topic of Atheism or Christianity. In this second part of the walkthrough we see how we can use LIME when we use all of the classes.

1: Loading the full dataset and training a new classifier¶

To begin, we will load the full newsgroups dataset from sklearn. To do this we must first re-load our dependancies, as in the first walk-through.

In [1]:
import lime
import sklearn
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics
from __future__ import print_function

This time we will load the 20newsgroups dataset and keep all classes of article, not only those related to Atheism of Christianity.

In [2]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='train')

print(newsgroups_train.target_names)
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

The names of the target classes for the full dataset are a bit lengthy, so let's trim them to something more readable.

In [3]:
class_names = [x.split('.')[-1] if 'misc' not in x else '.'.join(x.split('.')[-2:]) for x in newsgroups_train.target_names]
class_names[3] = 'pc.hardware'
class_names[4] = 'mac.hardware'
class_names[5] = 'windows.x'

print(', '.join(class_names))
atheism, graphics, ms-windows.misc, pc.hardware, mac.hardware, windows.x, misc.forsale, autos, motorcycles, baseball, hockey, crypt, electronics, med, space, christian, guns, mideast, politics.misc, religion.misc

TASK 01: Describe in your own words what the previous code block is doing to make the class names more readable.

Solution 01:

The first line uses a list comprehension (which is like a fancy for-loop) to consider each element of the list of target names.

  • If the name does not contain the string misc then the string is split at each period and only the final text string is retained.
  • If the string misc is in the name then the string is split in the same way but the final two strings are retained and rejoined, separated by a period.
  • The second and third lines correct the pc.hardware and mac.hardware categories, which both require the final two terms to be retained, but do not contain the misc substring.
  • Finally, the fourth line renames the windows 10 class label to windows.x, rather than just x.

Now that we have some tidy class names, we will use the tfidf vectoriser to again convert the text documents into numeric vectors. Once again, be aware of when fit_transformand transform are being used.

In [4]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(newsgroups_train.data)
test_vectors = vectorizer.transform(newsgroups_test.data)

Recall that we used a random forest classifier in the previous walk-through. To demonstrate that we can use any classifier with a predict_proba method, we will use a Naive Bayes classifier in this second walk-through.

Once more, it does not matter if you do not know how a Naive Bayes classifier works for this walk-through. You will meet naive Bayes classifiers in the supervised learning course, but for now can put yourself in the place of a stakeholder wanting an explanation from a "black-box" type model.

In [5]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=0.01)
nb.fit(train_vectors, newsgroups_train.target)
Out[5]:
MultinomialNB(alpha=0.01)

We can use this fitted classifer to predict the class of each document in the test set and calculate an F-score, as we did previously.

In [6]:
pred = nb.predict(test_vectors)
sklearn.metrics.f1_score(newsgroups_test.target, pred, average='weighted')
Out[6]:
0.9977016387434603

Again, we see that this classifier acheives a very high F score. This indicates that our classifier is doing very well at classifying the training data, but might also indicate that our classifier is overfitting to this dataset and would not generalise well to other texts.

We will look again at using LIME to explain individual predictions. Our aim here is to understand whether we can expect such excellent predictions to generalise beyond this dataset, or if the classifier is using irrelevant parts of the text to produce its predicted classes.

2: Explaining multi-class predictions using LIME¶

Again, we must import the text explainer from the lime module, create and instance of an explainer and set up a pipeline for prediction.

In [7]:
from lime import lime_text 
from sklearn.pipeline import make_pipeline

c = make_pipeline(vectorizer, nb)

We can then use this pipeline to predict to which class the first test text should belong.

In [8]:
print(c.predict_proba([newsgroups_test.data[0]]).round(3))
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
In [9]:
class_names[7]
Out[9]:
'autos'

The first test text is very clearly classified as belonging to the autos category. We can use a LIME explainer to see what words in the document are causing this very confident classification.

In [10]:
# Import and create an explainer for this multi-class text classifier
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)

# Use the explainer on the first test text
index = 0 

exp_0 = explainer.explain_instance(
    newsgroups_test.data[index],
    c.predict_proba,
    num_features=6,
    labels=[7])

TASK 02¶

Establish whether this confident prediction is correct and visualise the explanation of this prediction, including the content of the test text.

Comment on whether the words that best explain this prediction are likely to generalise beyond this data set.

SOLUTION 02¶

In [11]:
print('Document id: %d' % index)
print('Predicted class =', class_names[nb.predict(test_vectors[index]).reshape(1,-1)[0,0]])
print('True class: %s' % class_names[newsgroups_test.target[index]])

exp_0.show_in_notebook(text=True)
Document id: 0
Predicted class = autos
True class: autos

The words carand bumper seem sensible in determining that this text is about automobiles and are likely to generalise well. The sender name lerxst may well be a strong indicator of auto texts in this data set but is unlikely to generalise. The remaining explainers seem to have little to do wth automobiles, and so are very unlikely generalise to new examples.


We will also consider another the test text at index 600, which is classified with less certainty.


TASK 03:¶

For the test text at index 600, find the predicted probability of belonging to each class of text.

Find the names of the classes to which the Naive Bayes classifier assigns greater than a 5% predictive probability.

SOLUTION 03:¶

In [12]:
class_probs_600 = c.predict_proba([newsgroups_test.data[600]]).round(3)
print("Predicted probability of belonging to each class: \n")
print(class_probs_600[0], "\n")

likely_classes = [i for i, v in enumerate(class_probs_600[0]) if v > 0.05]
print(f"Classes with greater that 5% probability: {likely_classes}.\n")

for i in likely_classes: 
    print(f"Class {i} name: {class_names[i]}.")
Predicted probability of belonging to each class: 

[0.    0.068 0.01  0.    0.    0.001 0.003 0.062 0.005 0.004 0.    0.
 0.003 0.012 0.    0.83  0.    0.    0.    0.   ] 

Classes with greater that 5% probability: [1, 7, 15].

Class 1 name: graphics.
Class 7 name: autos.
Class 15 name: christian.

How can we use LIME to explain these probabilities?

Previously, we used the defalut parameter label when generating explanations. This worked well in the binary classification problem, but does not extend readily to a multi-class problem. Now we need to pick a small sub-set of the classes to explain. Below we will generate explanations for the three most likely class labels.

To do this we can use the same expainer that we created earlier to produce another explanation.

In [13]:
index = 600 

exp_600 = explainer.explain_instance(
    newsgroups_test.data[index],
    c.predict_proba,
    num_features=6,
    labels=[15, 1, 7])

print('Document id: %d' % index)
print('Predicted class =', class_names[nb.predict(test_vectors[index]).reshape(1,-1)[0,0]])
print('True class: %s' % class_names[newsgroups_test.target[index]])
Document id: 600
Predicted class = christian
True class: christian

Once again we have a correctly classified document, but this should not come as a great surprise when our classifier had such a high F Score.

We generate explanations for the three most likely classes, each consisting of the six most informative words for that class.

Note that the positive and negative signs are with respect to that particular label; words that count negatively towards the text being from one class might be count positively toward the same text belonging to another class.

In [14]:
print ('Explanation for class %s' % class_names[15])
print ('\n'.join(map(str, exp_600.as_list(label=15))))
print ()
print ('Explanation for class %s' % class_names[1])
print ('\n'.join(map(str, exp_600.as_list(label=1))))
print ()
print ('Explanation for class %s' % class_names[7])
print ('\n'.join(map(str, exp_600.as_list(label=7))))
print ()
Explanation for class christian
('gifford', 0.2660537351614391)
('Gifford', 0.26158542377029737)
('mil', -0.10052167814551549)
('navy', -0.09601476932804005)
('Mystery', 0.09267785742668981)
('paradox', 0.0918318898729503)

Explanation for class graphics
('Gifford', -0.08288384320898705)
('gifford', -0.08157409955370769)
('mil', 0.03973130380712043)
('navy', 0.038616769566119535)
('Paradox', -0.03219483379240528)
('dt', 0.03178777061780997)

Explanation for class autos
('gifford', -0.09136499841424854)
('Gifford', -0.08689513738425503)
('oasys', 0.04513584094943511)
('navy', 0.04238749598145842)
('dt', 0.041698072833737665)
('paradox', -0.03787341832831892)

It seems as though the same few words are good explainations for all three likely classes. Perhaps gifford refers to a particular sender or recipient? If so, we might suspect that they talk a lot about Christianity but do not have a lot to say about automobiles or graphics.

To investigate this further, we can create an annotated visualisation in much the same way as we did for the binary classification example.

In [15]:
exp_600.show_in_notebook(text=True)

Our suspicion was correct! Most of the words that explain this classification are in the header of the message and not in the (rather short) body of the message. Our classifier appears to have learned about Barbara Gifford's messaging interests. This use of message meta-information in the classifier means that it will likely not generalise well to other datasets.

To try and create a classifier that is more generally applicable, we will re-train the Naive Bayes model after removing the headers, footers and quotes from all messages.

3 Retraining without headers, quotes and footers¶

Let's strip away the meta-information contained in headers and footers. We will also get rid of any potentially misleading text in quotes, that might lead to a misclassification. Thankfully all of that tricky text manipulation has been done for us already. We can simply load another version of the dataset that contains only the body of the messages.

In [16]:
newsgroups_train_body = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test_body = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'))

train_vectors_body = vectorizer.fit_transform(newsgroups_train_body.data)
test_vectors_body = vectorizer.transform(newsgroups_test_body.data)

Now we will train a new Naive Bayes classifier on this data that only contains the body of the messages.

In [17]:
nb_body = MultinomialNB(alpha=.01)
nb_body.fit(train_vectors_body, newsgroups_train_body.target)

pred_body = nb_body.predict(test_vectors_body)
sklearn.metrics.f1_score(newsgroups_test_body.target, pred_body, average='weighted')
Out[17]:
0.6860756544750871

The classifier using only the body of the messages has a much lower F score. This means that (by this one metric at least) our predicted class allocations are not as good as they were with the meta-data.

Hopefully, this cost will have the benefit of our classifier using reasonable and generalisable features from the text to obtain its predictions. An investigation of whether or not this is the case is left to you as a final, open-ended task.


TASK 04¶

Create an explainer to explain the Naive Bayes classifier that does not use meta-information about the texts. Use this to compare the predictions and explainations to those made when using the full text.

Some questions to help guide your exploration:

  • Can you create explanations for the two most likely categories for several test texts?
  • Do these explanations seem sensible to you?
  • Do you think this classifier would generalise well to other texts?

  • How do these predictions and explanations compare to those made with meta-information?

  • (As an extesnion activity: can you quantify / visualise this comparison?)
  • Why is it difficult to directly compare explanations of texts without meta-information to the explanations that use meta-information?
  • (As an extension activity: can you fix this issue?)

SOLUTION 04¶

In [18]:
# We first make a pipline and an explainer
c_body = make_pipeline(vectorizer, nb_body)
explainer = LimeTextExplainer(class_names=class_names)

# We can then print the true class and explain the prediction for the test text at index 0
index = 0
print(class_names[newsgroups_test_body.target[index]])
exp_body = explainer.explain_instance(newsgroups_test_body.data[index], c_body.predict_proba, num_features=6, top_labels=2)
exp_body.show_in_notebook()


# And then for the test text at index 150
index = 150
print(class_names[newsgroups_test_body.target[index]])
exp_body = explainer.explain_instance(newsgroups_test_body.data[index], c_body.predict_proba, num_features=6, top_labels=2)
exp_body.show_in_notebook()
autos
space

Comparison to full-text predictions and explanations

  • Predictions sometimes made based on very little text
  • Explanatory words are generally more sensible and likely to generalise beyone this dataset than before
  • However, texts seem to be classified to topics with much less certainty than before.
  • (Optional Extension: can you quantify or visualise this reduction in certainty?)

Why a direct comparison is difficult

  • It is difficult to compare explanations of texts with and without meta-information because the split between the test and training data is different in the two datasets. For example, the text by Barbara Gifford is in the trianing set when meta-information is removed. (See code block below)
  • (Optional Extension: can you alter the existing import code or create a manual partitioning of the data to fix this issue?)
In [19]:
for text in range(0, len(newsgroups_test.data)):
    if "I have been looking for a book" in newsgroups_test.data[text]:
        print(f"Found at index {text}.")
        print(newsgroups_test.data[text])
print('Checked all entries in newsgroups_test.data')

print('\n ---------- \n')

for text in range(0, len(newsgroups_test_body.data)):
    if "I have been looking for a book" in newsgroups_test_body.data[text]:
        print(f"Found at index {text}")
        print(newsgroups_test_body.data[text])
print('Checked all entries in newsgroups_test_body.data')
Found at index 600.
From: gifford@oasys.dt.navy.mil (Barbara Gifford)
Subject: The Mystery in the Paradox
Reply-To: gifford@oasys.dt.navy.mil (Barbara Gifford)
Organization: Carderock Division, NSWC, Bethesda, MD
Lines: 9

I have been looking for a book that specifically addresses
the mystery of God in the paradox.  I have read some that touch
on the subject in a chapter but would like a more detailed read.

Is anyone aware of any books that deal with this subject.

Please e-mail me.  Thanks.

Barbara

Checked all entries in newsgroups_test.data

 ---------- 

Checked all entries in newsgroups_test_body.data

End of File